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MIXTURE MODELS WITH A PRIOR ON THE NUMBER OF 

COMPONENTS 

JEFFREY W. MILLER AND MATTHEW T. HARRISON 


Abstract. A natural Bayesian approach for mixture models with an unknown number of 
components is to take the usual finite mixture model with Dirichlet weights, and put a prior 
on the number of components—that is, to use a mixture of finite mixtures (MFM). While 
inference in MFMs can be done with methods such as reversible jump Markov chain Monte 
Carlo, it is much more common to use Dirichlet process mixture (DPM) models because of 
the relative ease and generality with which DPM samplers can be applied. In this paper, we 
show that, in fact, many of the attractive mathematical properties of DPMs are also exhib¬ 
ited by MFMs—a simple exchangeable partition distribution, restaurant process, random 
measure representation, and in certain cases, a stick-breaking representation. Consequently, 
the powerful methods developed for inference in DPMs can be directly applied to MFMs as 
well. We illustrate with simulated and real data, including high-dimensional gene expression 
data. 


1. Introduction 

Mixture models are used in a wide range of applications, including population structure 
(Pritchard et ah, 2000), document modeling (Blei et ah, 2003), speaker recognition (Reynolds 
et ah, 2000), computer vision (Stauffer and Grimson, 1999), phylogenetics (Pagel and Meade, 
2004), and gene expression profiling (Yeung et ah, 2001), to name a few prominent examples. 
A common issue with finite mixtures is that it can be difficult to choose an appropriate 
number of mixture components, and many methods have been proposed for making this 
choice (e.g., Henna, 1985; Keribin, 2000; Leroux, 1992; Ishwaran et ah, 2001; James et ah, 
2001 ). 

From a Bayesian perspective, perhaps the most natural approach is to treat the number 
of components like any other unknown parameter and put a prior on it. For short, we refer 
to such a model as a mixture of finite mixtures (MFM). Several inference methods have 
been proposed for this type of model (Nobile, 1994; Phillips and Smith, 1996; Richardson 
and Green, 1997; Stephens, 2000; Nobile and Fearnside, 2007), the most commonly-used 
method being reversible jump Markov chain Monte Carlo (Green, 1995; Richardson and 
Green, 1997). Reversible jump is a very general technique, and has been successfully applied 
in many contexts, but it is perceived to be difficult to use, and applying it to new situations 
requires one to design good reversible jump moves, which can be nontrivial, particularly in 
high-dimensional parameter spaces. 

Meanwhile, infinite mixture models such as Dirichlet process mixtures (DPMs) have be¬ 
come popular, partly due to the existence of generic Markov chain Monte Carlo (MCMC) 
algorithms that can easily be adapted to new applications (Neal, 1992, 2000; MacEach- 
ern, 1994, 1998; MacEachern and Muller, 1998; Bush and MacEachern, 1996; West, 1992; 
West et ah, 1994; Escobar and West, 1995; Liu, 1994; Dahl, 2003, 2005; Jain and Neal, 
2004, 2007). These algorithms are made possible by the fact that the Dirichlet process has 
a variety of elegant mathematical properties—an exchangeable partition distribution, the 
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Blackwell-MacQueen urn process (a.k.a. the Chinese restaurant process), a random discrete 
measure formulation, and the Sethuraman-Tiwari stick-breaking representation (Ferguson, 
1973; Antoniak, 1974; Blackwell and MacQueen, 1973; Aldous, 1985; Pitman, 1995, 1996; 
Sethuraman, 1994; Sethuraman and Tiwari, 1981). 

The purpose of this paper is to show that in fact, MFMs typically exhibit many of 
these same appealing properties—an exchangeable partition distribution, urn/restaurant 
process, random discrete measure formulation, and in certain cases, a simple stick-breaking 
representation—and consequently, that many of the inference techniques developed for DPMs 
can be directly applied to MFMs. In particular, these properties enable one to do inference 
in MFMs without using reversible jump. Interestingly, the key properties of MFMs hold for 
any choice of prior distribution on the number of components. 

There has been a large amount of research on efficient inference methods for DPMs, and 
an immediate consequence of the present work is that most of these methods can also be used 
for MFMs. Since many DPM sampling algorithms (for both conjugate and non-conjugate 
priors) are designed to have good mixing properties across a wide range of applications— 
for instance, the Jain-Neal split-merge samplers (Jain and Neal, 2004, 2007), coupled with 
incremental Gibbs moves (MacEachern, 1994; Neal, 1992, 2000)— this greatly simplifies the 
use of MFMs in new applications. 

This work resolves an open problem discussed by Green and Richardson (2001), who noted 
that it would be interesting to be able to apply DPM samplers to MFMs: 

“In view of the intimate correspondence between DP and [MFM] models dis¬ 
cussed above, it is interesting to examine the possibilities of using either class 
of MCMC methods for the other model class. We have been unsuccessful in 
our search for incremental Gibbs samplers for the [MFM] models, but it turns 
out to be reasonably straightforward to implement reversible jump split/merge 
methods for DP models. ” 

The paper is organized as follows. In the remainder of this section, we motivate this work 
with an overview of the similarities and differences between MFMs and DPMs, illustrated 
by a simulation example. In Sections 2 and 3, we formally define the MFM and show that 
it gives rise to a simple exchangeable partition distribution closely paralleling that of the 
Dirichlet process. In Section 4, we present the Polya urn scheme (restaurant process), random 
discrete measure formulation, and stick-breaking representation for the MFM. In Section 5, 
we establish some asymptotic results for MFMs. In Section 6, we show how the properties 
in Sections 3 and 4 lead to efficient inference algorithms for the MFM, and in Section 7, 
we apply the model to the galaxy dataset (a standard benchmark) and to high-dimensional 
gene expression data used to discriminate cancer subtypes. We close with a brief discussion. 

1.1. Background: Similarities and differences between MFMs and DPMs. In many 
respects, MFMs and DPMs are quite similar, but there are some important differences. In 
general, we would not say that one model is uniformly better than the other; rather, one 
should choose the model which is best suited to the application at hand. 

Density estimation. For certain nonparametric density estimation problems, both models 
have been shown to exhibit posterior consistency at the minimax optimal rate, up to loga¬ 
rithmic factors (Kruijer et al., 2010; Ghosal and Van der Vaart, 2007). Even for small sample 
sizes, we observe empirically that density estimates under the two models are remarkably 
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FIGURE 1. Density estimates for MFM (left) and DPM (right) on increasing 
amounts of data from a three-component Gaussian mixture (bottom). As 
n increases, the estimates appear to be converging to the true density, as 
expected. See Section 7.1 for details. 
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Figure 2. Hellinger distance to the true density, for MFM (red, left) and 
DPM (blue, right) density estimates, on data from a three-component Gauss¬ 
ian mixture. For each n G {50,100,250,1000}, five independent datasets of 
size n were used, and the lines connect the averages of the distances for each 
n. See Section 7.1 for details. 


similar. As a toy example, Figure 1 compares their density estimates on data from a bi¬ 
variate Gaussian mixture with three components. As the amount of data increases, these 
density estimates appear to be converging to the true density, as expected; indeed, Figure 2 
indicates that the Hellinger distance to the true density is going to zero. 


Clustering. It seems that more often, mixture models are used for clustering and latent class 
discovery rather than density estimation. MFMs have a partition distribution that takes 
a very similar form to that of the Dirichlet process; see Section 3. However, despite this 
similarity, the MFM partition distribution differs in two fundamental respects. While the 
first is widely known, the second is far less often appreciated, and yet is perhaps even more 
important. 

(1) The prior on the number of clusters t is very different. In an MFM, one has complete 
control over the prior on the number of components k, which in turn provides control 
over the prior on t. As the sample size n grows, in an MFM the prior on t converges 
to the prior on k (in fact, t converges to k almost surely). In contrast, in a Dirichlet 
process, the prior on t takes a particular parametric form and diverges at a logn rate. 

(2) Given t, the prior on the cluster sizes is very different. In an MFM, most of the 
prior mass is on partitions in which the sizes of the clusters are all the same order 
of magnitude, while in a Dirichlet process, most of the prior mass is on partitions in 
which the sizes vary widely, with a few large clusters and many very small clusters. 

See Section 5 for more precise descriptions of (1) and (2) in mathematical terms. (Note 
that while some authors use the terms “cluster” and “component” interchangeably, we use 
cluster to refer to a group of data points, and component to refer to one of the probability 
distributions in a mixture model.) 
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FIGURE 3. Typical sample clusterings from the posterior for the MFM (left) 
and DPM (right), on n = 250 data points from a three-component Gaussian 
mixture; the bottom plot shows the true component assignments. Note the 
small extra clusters in the DPM samples (red squares). Best viewed in color. 
See Section 7.1 for details. 
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t (number of clusters) t (number of clusters) 



k (number of components) 


FIGURE 4. Posterior on the number of clusters t for the MFM (top left) 
and DPM (top right), and the posterior on the number of components k for 
the MFM (bottom), on increasing amounts of data from a three-component 
Gaussian mixture. See Section 7.1 for details. 


These prior differences carry over to noticeably different posterior clustering behavior. For 
instance, Figure 3 displays typical clusterings sampled from the posterior, illustrating that 
DPM samples tend to have tiny “extra” clusters, while MFM samples do not. 

ft should also be mentioned that due to (2), MFMs “dislike” partitions with very small 
clusters, causing incremental Gibbs samplers to mix more slowly (empirically) when n is 
large, however, this is easily remedied by using split-merge samplers (Jain and Neal, 2004, 
2007); see Section 6. 

Mixing distribution and the number of components. Assuming that the data is from a finite 
mixture, it is also sometimes of interest to infer the mixing distribution or the number of 
components, subject to the caveat that these inferences are meaningful only to the extent 
that the component distributions are correctly specified and the model is mixture identifiable. 
While Nguyen (2013) has shown that under certain conditions, DPMs are consistent for the 
mixing distribution (in the Wasserstein metric), Miller and Harrison (2014) have shown that 
the posterior on the number of clusters in a DPM is typically not consistent for the number 
of components. On the other hand, MFMs are consistent for the mixing distribution and 
the number of components (for Lebesgue almost-all parameter values) under very general 
conditions; this is a straightforward consequence of Doob’s theorem (Nobile, 1994). The 
relative ease with which this consistency can be established for MFMs is due to the fact that 
in an MFM, the parameter space is a countable union of finite-dimensional spaces, rather 
than an infinite-dimensional space. 

These consistency/inconsistency properties are readily observed empirically—they are not 
simply large-sample phenomena. As seen in Figure 4, the tendency of DPM samples to have 
tiny extra clusters causes the number of clusters t to be somewhat inflated, apparently making 
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the DPM posterior on t fail to concentrate, while the MFM posterior on t concentrates at the 
true value (see Section 5.2). In addition to the number of clusters t, the MFM also permits 
inference for the number of components k in a natural way (Figure 4), while in the DPM 
the number of components is always infinite. See Section 8 for discussion regarding issues 
with estimating the number of components. 

2. Model 

We consider the following well-known model: 

K ~ p K , where px is a p.m.f. on {1,2,...} 

(tti, ..., 7 r fe ) ~ Dirichlet fc ( 7 ,..., 7 ), given K = k 

( 2 . 1 ) Z u ... ,Z n ~ 7 T, given tt 

9i ,..., Ok ~ H, given K = k 

Xj ~ fg z independently for j — 1,..., n, given Qi.x, Z\. n . 

Here, H is a prior or “base measure” on 0 C M £ , and {fg : 0 G 0} is a family of probability 
densities with respect to a sigma-finite measure f on X C R d . (As usual, we give 0 and 
X the Borel sigma-algebra, and assume (x, 6) 1 —> fg(x) is measurable.) We denote x 1:n = 
(xi,... , x n ). Typically, the values X 1} ..., X n would be observed, and all other variables 
would be hidden/latent. We refer to this as a mixture of finite mixtures (MFM) model. 

It is important to note that we assume a symmetric Dirichlet with a single parameter 
7 not depending on k. This assumption is key to deriving a simple form for the partition 
distribution and the other resulting properties. Assuming symmetry in the distribution of 7 r 
is quite natural, since the distribution of Xi ,..., X n under any asymmetric distribution on 
7 r would be the same as if this were replaced by its symmetrized version, i.e., if the entries 
of 7 r were uniformly permuted (although this would no longer necessarily be a Dirichlet 
distribution). Assuming the same 7 for all k is a genuine restriction, albeit a fairly natural 
one, often made in such models even when not strictly necessary (Nobile, 1994; Phillips and 
Smith, 1996; Richardson and Green, 1997; Green and Richardson, 2001; Stephens, 2000; 
Nobile and Fearnside, 2007). Note that prior information about the relative sizes of the 
mixing weights 717 ,... ,717 can be introduced through 7 —roughly speaking, small 7 favors 
lower entropy 7r’s, while large 7 favors higher entropy 7r’s. 

Meanwhile, we put very few restrictions on px, the distribution of the number of com¬ 
ponents. For practical purposes, we need the infinite series Y^T=iPk{^) f° converge to 1 
reasonably quickly, but any choice of px arising in practice should not be a problem. For 
certain theoretical purposes—in particular, consistency for the number of components—it is 
desirable to have pxifi) > 0 for all k G {1, 2,... }. 

For comparison, the Dirichlet process mixture (DPM) model with concentration parameter 
a > 0 and base measure H is defined as follows, using Sethuraman’s (1994) representation: 

Bi, B 2l ... ~ Beta(l, a) 

Zi,...,Z n ~ 7 r, given 7r = ( 711 , 7 r 2 ,...) where 7 r* = B, ET/Iit 1 ~ B j) 

0 1 ,0 2 ,... i ^H 

Xj ~ fg z , independently for j = 1,..., n, given 0 1:OOl Z 1:n . 
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3. Exchangeable partition distribution 

The primary observation on which our development relies is that the distribution on 
partitions induced by an MFM takes a form which is simple enough that it can be easily 
computed. Let C denote the unordered partition of [n] := {1,..., n} induced by Z \,..., Z n \ 
in other words, C = {Ei : \Ei\ > 0} where Ei = {j : Zj = i} for i G {1, 2,... 

Theorem 3.1. Under the MFM (Equation 2.1), the probability mass function of C is 

(3.i) p(o=K W n# 

ceC 

where t — \C\ is the number of parts in the partition, and 

OO 

P- 2 ) = 

All proofs have been collected in Appendix B. Here, x ^ = x(x + 1) • • • (x + m — 1) 
and X( m ) = x(x — 1) • • • (x — m + 1), with x= 1 and X(o) = 1 by convention. We 
discuss computation of V n (t) in Section 3.2. For comparison, under the DPM, the partition 
distribution induced by Z 1 ,..., Z n is p DPM (C) = rTceC(I c l — !)• (Antoniak, 1974). 

Viewed as a function of the part sizes (|c| : c G C ), Equation 3.1 is an exchangeable 
partition probability function (EPPF) in the terminology of Pitman (1995, 2006), since it is 
a symmetric function of the part sizes. Consequently, C is an exchangeable random partition 
of [n]; that is, its distribution is invariant under permutations of [n] (alternatively, this can 
be seen directly from the definition of the model, since Z\, ..., Z n are exchangeable). 

More specifically, we observe that Equation 3.1 is a member of the family of Gibbs partition 
distributions (Pitman, 2006); this is also implied by the results of Gnedin and Pitman (2006) 
characterizing the extreme points of the space of Gibbs partition distributions. Further 
results on Gibbs partitions are provided by Ho et al. (2007), Lijoi et al. (2008), Cerquetti 
(2008, 2011), Gnedin (2010), and Lijoi and Priinster (2010). However, the utility of this 
representation for inference in mixture models with a prior on the number of components 
does not seem to have been previously explored in the literature. 

Due to Theorem 3.1, we have the following equivalent representation of the model: 

C ~ p[C), with p(C) as in Equation 3.1 

(3.3) (j) c ~ H for c G C, given C 

Xj rs-/ f^ c independently for j G c, c G C, given 0, C, 

where 0 = (0 C : c G C) is a tuple of t — \C\ parameters 0 C G 0, one for each part cGC. 

This representation is particularly useful for doing inference, since one does not have to 
deal with cluster labels or empty components. The formulation of models starting from 
a partition distribution has been a fruitful approach, exemplified by the development of 
product partition models (Hartigan, 1990; Barry and Hartigan, 1992; Quintana and Iglesias, 
2003; Dahl, 2009; Park and Dunson, 2010; Muller and Quintana, 2010; Muller et ah, 2011). 

3.1. Basic properties. We list here some basic properties of the MFM model. See Appen¬ 
dix B for proofs. Denoting x c = (xj : j G c) and m(x c ) = f e [ Y\ ]( z c fo{ x j)\ H{dff) (with the 
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convention that m{x 0 ) = 1), we have 

(3.4) p(xi:n\C) = \\m{x c ). 

ceC 


The number of components K and the number of clusters T — \C\ are related by 


< 3 - 5 ) P(W = £ IP (I 7 

W ’ C:\C\=tc£C 

AO pm = 

where in Equation 3.5, the sum is over partitions C of [n] such that \C\ = t. The formula for 
p(k\t) is required for doing inference about the number of components K based on posterior 
samples of C\ fortunately, it is easy to compute. We have the conditional independence 
relations 


(3.7) C±K\ T, 

(3.8) X 1:n ±K\T. 


3.2. The coefficients V n (t). The following recursion is a special case of a more general 
result for Gibbs partitions (Gnedin and Pitman, 2006). 

Proposition 3.2. The numbers V n (t) (Equation 3.2) satisfy the recursion 

(3.9) K+i (t + 1) = V n (t)/ 7 - [nj 7 + t)V n+1 (t) 
for any 0 <t<n and 7 > 0. 

This is easily seen by plugging the identity 

k(t+ 1 ) = ( 7 k + n)k {t) /7 - (n /7 + t)k (t) 

into the expression for V n+1 (t + 1). In the case of 7 = 1, Gnedin (2010) has discovered a 
beautiful example of a distribution on K for which both px{k ) and V n (t) have closed-form 
expressions. 

In previous work on the MFM model, it has been common for px to be chosen to be 
proportional to a Poisson distribution restricted to the positive integers or a subset thereof 
(Phillips and Smith, 1996; Stephens, 2000; Nobile and Fearnside, 2007), and Nobile (2005) 
has proposed a theoretical justification for this choice. Interestingly, the model has some nice 
mathematical properties if one instead chooses K — 1 to be given a Poisson distribution, that 
is, px{k) = Poisson(fc — 11A) for some A > 0. One example of this arises here (for another 
example, see Section 4.3): it turns out that if Px(k) = Poisson(h — 11A) and 7 = 1 then 

(3.10) K(°) = Wi-£>kM). 

k =1 

However, to do inference, it is not necessary to choose pj< to have any particular form. To 
do inference, we just need to be able to compute p(C), and in turn, we need to be able to 
compute V n (t). To this end, note that k^/f^k)^ < k t /( , yk) n , and thus the infinite series for 
V n (t) converges rapidly when t <C n. It always converges to a finite value when 1 < t < n; 
this is clear from the fact that p(C) G [0,1]. This finiteness can also be seen directly from the 
series since k t f[p{kf a < 1 / 7 ", and in fact, this shows that the series for V n (t) converges at 
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least as rapidly (up to a constant) as the series YlkLiPxik) converges to 1. Hence, for any 
reasonable choice of px (he., not having an extraordinarily heavy tail), V n (t) can easily be 
numerically approximated to a high level of precision. In practice, computing the required 
values of V n (t) takes a negligible amount of time. 

3.3. Self-consistent marginals. For each n — 1,2,..., let q n (C) denote the distribution 
on partitions of [n] as defined above (Equation 3.1). This family of partition distributions is 
preserved under marginalization, in the following sense. 

Proposition 3.3. If m < n then q m coincides with the marginal distribution on partitions 
of [m\ induced by q n . 

In other words, drawing a sample from q n and removing elements m + 1 ,,n from it 
yields a sample from q m . This can be seen directly from the model definition (Equation 2.1), 
since C is the partition induced by the Z’s, and the distribution of Z\ :m is the same when 
the model is defined with any n > m. This property is sometimes referred to as consistency 
in distribution (Pitman, 2006). 

By Kolmogorov’s extension theorem (e.g., Durrett, 1996), it is well-known that this implies 
the existence of a unique probability distribution on partitions of the positive integers Z >0 = 
{1,2,...} such that the marginal distribution on partitions of [n] is q n for all n G {1, 2,... }. 
A random partition of Z >0 from such a distribution is a combinatorial stochastic process ; for 
background, see Pitman (2006). 


4. Restaurant process, stick-breaking, and random measure 

REPRESENTATIONS 


4.1. Polya urn scheme / Restaurant process. Pitman (1996) considered a general class 
of urn schemes, or restaurant processes, corresponding to exchangeable partition probability 
functions (EPPFs). The following scheme for the MFM falls into this general class. 


Theorem 4.1. The following process generates partitions Ci,C 2 , • • • such that for any n G 
{1, 2,...}, the probability mass function of C n is given by Equation 3.1. 


Initialize with a single cluster consisting of element 1 alone: C\ = {{1}}. 
For n = 2,3,..., element n is placed in ... 

an existing cluster c G C n _i with probability cx |c| + 7 

a new cluster with probability cx ^ 7 


where t = |C n -i| ■ 


V n (t) 


Clearly, this bears a close resemblance to the Chinese restaurant process (i.e., the 
Blackwcll-MacQueen urn process), in which the nth element is placed in an existing cluster 
c with probability oc |c| or a new cluster with probability cx a (the concentration parameter) 
(Blackwell and MacQueen, 1973; Aldous, 1985). 


4.2. Random discrete measures. The MFM can also be formulated starting from a dis¬ 
tribution on discrete measures that is analogous to the Dirichlet process. With K , 7 r, and 
6\:k as in Equation 2.1, let 
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where 5g is the unit point mass at 0. Let us denote the distribution of G by Xi(pK,'y, H). 
Note that G is a random discrete measure over 0. If H is continuous (i.e., H({9}) = 0 for 
all 6 G 0 ), then with probability 1 , the number of atoms is K ; otherwise, there may be fewer 
than K atoms. If we take ... ,X n \G i.i.d. from the resulting mixture, namely, 

K 

fa(x) ■= / fg(x)G(dd) = Tif 0i {x), 

J i= 1 

then the distribution of X X:n is the same as before. So, in this notation, the MFM model is: 

G ~ M(p K , 7 , H) 

Xi ,..., X n ~ fa, given G. 

This random discrete measure perspective is connected to work on species sampling models 
(Pitman, 1996) in the following way. When H is continuous, we can construct a species 
sampling model by letting G ~ M(pk, 7 , H) and modeling the observed data as /?i,..., (3 n ~ 
G. We refer to Pitman (1996), Hansen and Pitman (2000), Ishwaran and James (2003), 
Lijoi et al. (2005, 2007), and Lijoi et al. (2008) for more background on species sampling 
models and further examples. The following posterior predictive rule for this particular 
model follows the form of Pitman’s general rule; note the close relationship to the restaurant 
process (Theorem 4.1 above). 

Theorem 4.2. If H is continuous, then f3 x ~ H and the distribution of (3 n given ft i,..., 0 n -1 
is proportional to 

(4-1) Vn ftf > '<H + Y i (n i + 1 )S K , 

' ' i= 1 

where f3\, ... ,/3f are the distinct values taken by (3 X ,, f3 n -i, an d rii = 6 [n — 1 ] : / 3j = 

ft*}- 

For comparison, when G ~ DP (aH) instead, the distribution of [3 n given f3i ,..., fi n -\ is 
proportional to aH+J2"jZl dg :i = n iP*, since G\/3 X ,..., /3 n -1 ~ DP(aH+JZ'j=i dg :i ) 

(Ferguson, 1973; Blackwell and MacQueen, 1973). 

4.3. Stick-breaking representation. The Dirichlet process has an elegant stick-breaking 
representation for the mixture weights 7 Ti, 7 T 2 ,... (Sethuraman, 1994; Sethuraman and Ti- 
wari, 1981). This extraordinarily clarifying perspective has inspired a number of nonpara- 
metric models (MacEachern, 1999, 2000; Hjort, 2000; Ishwaran and Zarepour, 2000; Ishwaran 
and James, 2001; Griffin and Steel, 2006; Dunson and Park, 2008; Chung and Dunson, 2009; 
Rodriguez and Dunson, 2011; Broderick et ah, 2012), has provided insight into the prop¬ 
erties of related models (Favaro et ah, 2012; Teh et ah, 2007; Thibaux and Jordan, 2007; 
Paisley et ah, 2010), and has been used to develop efficient inference algorithms (Ishwaran 
and James, 2001; Blci and Jordan, 2006; Papaspiliopoulos and Roberts, 2008; Walker, 2007; 
Kalli et ah, 2011). 

In a certain special case—namely, when pK(k) = Poisson(fc — 11 A) and 7 = 1—we have 
noticed that the MFM also has an interesting representation that we describe using the 
stick-breaking analogy, although it is somewhat different in nature. This is another example 
of the nice mathematical properties resulting from this choice of px and 7 . Consider the 
following procedure: 
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Take a unit-length stick, and break off i.i.d. Exponential(A) pieces until you 
run out of stick. 

In other words, let € 1 , 62 , ■■■ ~ Exponential A), define K = min{j : e * — 1}; an d se ^ 

TTi = €i for i = 1 ,..., K - 1 and = 1 - Ya=i ^i- 

Proposition 4.3. The stick lengths tt have the same distribution as the mixture weights tt 
in the MFM model when px{k ) = Poisson(/c — 11 A) and 7 = 1. 

This is a consequence of a standard construction for Poisson processes. This suggests a 
way of generalizing the MFM model: take any sequence of nonnegative random variables 
(ei,e 2 ,...) (not necessarily independent or identically distributed) such that Y^=i e i > 1 
with probability 1, and define K and it as above. Although the distribution of K and 7r 
may be complicated, in some cases it might still be possible to do inference based on the 
stick-breaking representation. This might be an interesting way to introduce different kinds 
of prior information on the mixture weights, however, we have not explored this possibility. 

5. Asymptotics 

In this section, we consider the asymptotics of V n (t), the asymptotic relationship between 
the number of components and the number of clusters, and the approximate form of the 
conditional distribution on cluster sizes given the number of clusters. 

5.1. Asymptotics of V n {t). Recall that V n (t) = Ysk=i Pi<{k) (Equation 3.2) for 1 < 
t < n, with 7 > 0 and px a p.m.f. on {1, 2 ,... }. 

Theorem 5.1. For any t G {1, 2,... }, if px(t) > 0 then 

t 5 - 1 ) ~ (77> «T) ~ I ^ p*(«) 

as n —* 00. 

In particular, V n (t) has a simple interpretation, asymptotically—it behaves like the k = t 
term in the series. 

5.2. Relationship between the number of clusters and number of components. In 

the MFM, it is perhaps intuitively clear that, under the prior at least, the number of clusters 
T — \C\ behaves very similarly to the number of components K when n is large. It turns 
out that under the posterior they also behave very similarly for large n. 

Theorem 5.2. Let X\,X 2 ,... G X and k 6 {1, 2,... }. If Pk{ 1), • • • ,Pi<(k) > 0 then 

| p(T = k | x 1:n ) - p(K = k | x 1:n ) | —* 0 

as n —> 00. 

5.3. Distribution of the cluster sizes under the prior. Here, we examine one of the 

major differences between the MFM and DPM priors. Roughly speaking, under the prior, the 
MFM prefers all clusters to be the same order of magnitude, while the DPM prefers having 
a few large clusters and many very small clusters. In the following calculations, we quantify 
the preceding statement more precisely. (See Green and Richardson (2001) for informal 
observations along these lines.) Interestingly, these prior influences remain visible in certain 
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aspects of the posterior, even in the limit as n goes to infinity, as shown by the inconsistency 
of DPMs for the number of components in a finite mixture (Miller and Harrison, 2014). 

Let C be the partition of [n\ in the MFM model (Equation 3.1), and let A = (Ai ,..., At) 
be the ordered partition of [n] obtained by randomly ordering the parts of C, uniformly 
among the T! possible choices, where T — \C\. Then 

' i= 1 


where t — \C\. Now, let S = (Si,..., St) be the vector of part sizes of A, that is, S) = \Ai\. 
Then 

r—_ r>! -J—f 

P (s=s)= y . pc) = 

A:S(A)=s ‘ i= 1 *' 

for s G A t , t G {l,...,n}, where A t = {s G 1} : JAsj = n, s* > lVz} (i.e., the t-part 
compositions of n). For any x > 0, writing jm\ = r(x+m) /(m! T(x)) and using Stirling’s 
approximation, we have x^/ml ~ m^ 1 /Y(x) as m —» oo. This yields the approximations 


p(S = s) 


V n (t) n\ 


n 


—T l l S 


. 7-1 


r (7)* t\ 

(using Theorem 5.1 in the second step), and 


pn(t) r(yf) -A- 7 _! 
n-rt- 1 r( 7 )* JyJ-N 


t 

p(^ = s |T = t)^KjJs7“ 1 

i =1 


for s G A t , where k is a normalization constant. Thus p(s|f), although a discrete distribution, 
has approximately the same shape as a symmetric t-dimensional Dirichlet distribution. This 
would be obvious if we were conditioning on the number of components K , and it makes 
intuitive sense when conditioning on T, since K and T are essentially the same for large n. 

It is very interesting to compare this to the corresponding distributions for Dirichlet process 
mixtures. In the DPM, we have p DPM (C) = Jhy Idcec(l c l - 1 )-’ anci Popu(A) = p DPM (C)/|C|! as 
before, so for s G A t , t G {1,..., n}, 

i q i n\ a* _j —i 
p DPU (S = s) = ^ - Sl ---St 


and 


Pdpu(S = s \T = t) oc 1 ■ • • s^, 

which has the same shape as a f-dimensional Dirichlet distribution with all the parameters 
taken to 0 (noting that this is normalizable since A t is finite). Asymptotically in n, PcpM^f) 
puts all of its mass in the “corners” of the discrete simplex A t , while under the MFM, p(s\t) 
remains more evenly dispersed. 


6. Inference algorithms 

As shown by the results of Sections 3 and 4, MFMs have many of the same properties as 
DPMs. As a result, much of the extensive body of work on MCMC samplers for DPMs can 
be directly applied to MFMs, including samplers for conjugate and non-conjugate cases, as 
well as split-merge samplers. 
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When H is a conjugate prior for {fe}, such that the marginal likelihood m(x c ) = 
/e[n,« fg( x j)] H(d9) can be easily computed, the following Gibbs sampling algorithm 
can be used to sample from the posterior on partitions, p(C\xi :n ). Given a partition C, let 
C\j denote the partition obtained by removing element j from C. 


( 1 ) 

( 2 ) 


Initialize C = {[rz]} (i.e., one cluster). 

Repeat the following N times, to obtain N samples. 

For j = 1 ,,n: Remove element j from C, and place it 

.m(x cUj ) 


in c E C\j with probability oc (|c| + 7 )- 


in a new cluster with probability oc 7 
where t — \C \ j\. 


m (x c ) 
V n (t + 1) 

V n (t) 


m 


>i) 


This is a direct adaptation of “Algorithm 3” for DPMs (MacEachern, 1994; Neal, 1992, 2000). 
The only differences are that in Algorithm 3, |c| +7 is replaced by |c|, and + 1)/I4(t) 
is replaced by a (the concentration parameter). Thus, the differences between the MFM 
and DPM versions of the algorithm are precisely the same as the differences between their 
respective restaurant processes. Computing the required values of V n {t) takes a negligible 
amount of time compared to running the sampler. In order for the algorithm to be valid, 
the Markov chain needs to be irreducible, and to achieve this it is necessary to have {t G 
{1,2,...} : V n (t) > 0} be a block of consecutive integers. I 11 fact, it turns out that this is 
always the case (and this block includes t = 1 ), since for any k such that px(k) > 0 , we have 
V n (t) > 0 for all t = l,...,k. 

When H is a non-conjugate prior (and m(x c ) cannot be easily computed), a clever auxiliary 
variable technique referred to as “Algorithm 8 ” can be used for inference in the DPM (Neal, 
2000; MacEachern and Muller, 1998). Making the same substitutions as above, we can apply 
Algorithm 8 to perform inference in the MFM as well; see Miller (2014) for details. 

A well-known issue with incremental Gibbs samplers such as these, however, when applied 
to DPMs, is that the mixing can be somewhat slow, since it may take a long time to create 
or destroy substantial clusters by moving one element at a time. With MFMs, this issue 
seems to be exacerbated, since MFMs tend to put small probability (compared with DPMs) 
on partitions with tiny clusters (see Section 5.3), making it difficult for the sampler to move 
through these regions of the space. 

To deal with this issue, split-merge samplers for DPMs have been developed, in which a 
large number of elements can be reassigned in a single move (Dahl, 2003, 2005; Jain and Neal, 
2004, 2007). In the same way as the incremental samplers, one can directly apply these split- 
merge samplers (both conjugate and non-conjugate) to MFMs, using the properties described 
in Sections 3 and 4. More generally, it seems likely that any partition-based MCMC sampler 
for DPMs could be applied to MFMs as well. 

In Section 7, we apply the Jain-Neal split-merge samplers coupled with incremental Gibbs 
samplers, in both conjugate and non-conjugate settings. 


7. Empirical demonstrations 

In this section, we demonstrate the MFM on simulated and real datasets. All of the 
examples below involve Gaussian component densities, but of course our approach is not 
limited to mixtures of Gaussians. 
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7.1. Simulation example. In the introduction, we presented several figures comparing the 
MFM and DPM on data from a three-component bivariate Gaussian mixture, illustrating 
the behavior of the MFM with respect to density estimation, clustering, and inference for 
the number of components. Here, we provide the details of the data, model, and method of 
inference for this simulation example. 


Data. The data distribution is WiJ\f(ni, Cf) where w = (0.45,0.3,0.25), /q = 
ft2 = (I), ft. = (§), C, = (J»), C 2 = R(V 0 ° 2 )R T where R = (“£ ^"/) with p = 
airdC 3 = ( g A ) - 


«), 

w/4, 


Model. The component densities are multivariate normal, fe{x) = /^(x) = A^(t|/i,A _ 1 ) 
and the base measure (prior) H on 8 = (fj,, A) is /i ~ A7(/2, C), A ~ Wishartd(V, u) in¬ 
dependently, where fi is the sample mean, C is the sample covariance, v — d — 2, and 
V = C~ x jv. Here, Wishart d (A|V, v) oc | det A|T _d_1 )/ 2 exp ( — |tr(l/ _1 A)j. Note that this is 
a data-dependent prior. 

For the MFM, we take K ~ Geometric(r) ( px(k ) = (1 — r) fe_1 r for k = 1,2,...) with 
r = 0.1, and we choose 7 = 1 for the finite-dimensional Dirichlet parameters. For the DPM, 
we put an Exponential(l) prior on the concentration parameter, a. 

Note that taking /i and A to be independent results in a non-conjugate prior. This prior 
is appropriate when the location of the data is not informative about the covariance (and 
vice versa). 


Inference. For both the MFM and DPM, we use the non-conjugate split-merge sampler 
of Jain and Neal (2007), coupled with Algorithm 8 of Neal (2000) (using a single auxiliary 
variable) for incremental Gibbs updates to the partition. Specifically, following Jain and Neal 
(2007), we use the (5,1,1,5) scheme: 5 intermediate scans to reach the split launch state, 1 
split-merge move per iteration, 1 incremental Gibbs scan per iteration, and 5 intermediate 
moves to reach the merge launch state. Gibbs updates to the DPM concentration parameter 
a are made using Metropolis-Hastings moves. 

Five independent datasets were used for each n e {50,100, 250,1000}, and for each model 
(MFM and DPM), the sampler was run for 5,000 burn-in iterations and 95,000 sample 
iterations (for a total of 100,000). Judging by traceplots and running averages of various 
statistics, this appeared to be sufficient for mixing. The cluster sizes were recorded after 
each iteration, and to reduce memory storage requirements, the full state of the chain was 
recorded only once every 100 iterations. For each run, the seed of the random number 
generator was initialized to the same value for both the MFM and DPM. 

For a dataset of size n, the sampler used for these experiments took approximately 8 x 
10~ 6 n seconds per iteration, using a 2.80 GHz processor with 6 GB of RAM. 


Results. As described in the introduction, the results of this simulation empirically indicate 
that on data from a finite mixture, MFMs and DPMs are consistent for the density (Figures 
1 and 2), DPM clusterings tend to have small extra clusters while MFM clusterings do not 
(Figure 3), and MFMs are consistent for the number of components while DPMs are not 
(Figure 4). This is what we expect from theory (although to be precise, the inconsistency 
result of Miller and Harrison (2014) only applies to the case of fixed concentration parameter 
a). These results are not too surprising, since when the data distribution is a finite mixture 
from the assumed family, the MFM is correctly specified, while the DPM is not. On data from 
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an infinite mixture, one would expect the DPM to have certain advantages. See Appendix 
A for formulas for computing the posterior on k and the density estimates. 

7.2. Galaxy dataset. The galaxy dataset (Roeder, 1990) is a standard benchmark for 
mixture models, consisting of measurements of the velocities of 82 galaxies in the Corona 
Borealis region; see Figure 5. The purpose of this example is to demonstrate agreement 
between our method and published results using reversible jump MCMC with the same 
model, and also to show that using hyperpriors presents no difficulties. 

Model. To enable comparison, we use exactly the same model as Richardson and Green 
(1997). The component densities are univariate normal, fe(x) = / Mi \(x) = J\f(x\ii, A^ 1 ), 
and the base measure H on 6 = (p, A) is /i ~ A^/io,^), A ~ Gamma(a,6) independently 
(where Gamma(A|a, b) oc A a_1 e _?lA ). Further, a hyperprior is placed on b, by taking b ~ 
Gamma(a 0 , b 0 ). The remaining parameters are set to p 0 = (niaxjxj} + min{xj})/2, a 0 = 
max{xj} — min{xj}, a = 2, a 0 = 0.2, and b 0 = lO/cg. Note that the parameters p 0 ; cr 0 , 
and b 0 are functions of the observed data x±,... ,x n . See Richardson and Green (1997) for 
the rationale behind these parameter choices. (Note: This choice of 07 may be a bit too 
large, affecting the posteriors on the number of clusters and components, however, we stick 
with it to enable comparisons to Richardson and Green (1997).) For the MFM, following 
Richardson and Green (1997), we take K ~ Uniform{l,..., 30} and 7 = 1. For the DPM, 
we take a ~ Exponential(l). 

Inference. As before, we use the non-conjugate split-merge sampler of Jain and Neal (2007) 
coupled with Algorithm 8 of Neal (2000), and Gibbs updates to the DPM concentration 
parameter a are made using Metropolis-Hastings. We use Gibbs sampling to handle the 
hyperprior on b (i.e., append b to the state of the Markov chain, run the sampler given b as 
usual, and periodically sample b given everything else). More general hyperprior structures 
can be handled similarly. In all other respects, the same inference algorithm as in Section 
7.1 was used. 

We do not restrict the parameter space in any way (e.g., forcing the component means 
to be ordered to obtain identihability, as was done by Richardson and Green (1997)). All 
of the quantities we consider are invariant to the labeling of the clusters. See Jasra et al. 
(2005) for discussion on this point. 

The sampler was run for 5,000 burn-in iterations, and 45,000 sample iterations. This 
appeared to be more than sufficient for mixing. Cluster sizes were recorded after each 
iteration, and the full state of the chain was recorded every 50 iterations. Each iteration 
took approximately 8 x 1CT 6 n seconds, with n = 82. 

Results. Figure 5 shows the estimated densities and the posteriors on the number of clusters 
and components. Comparing this with Figure 2(c) of Richardson and Green (1997), we see 
that our MFM density estimate is visually indistinguishable from theirs (as it should be, 
since we are using the same model with the same parameters). 

Table 1 compares our estimate of the MFM posterior on the number of components k with 
the results of Richardson and Green (1997). Again, the results are very close, as expected. 

7.3. Discriminating cancer types using gene expression data. In cancer research, 
gene expression profiling—that is, measuring the degree to which each gene is expressed by 
a given tissue sample under given conditions—enables the identification of distinct subtypes 
of cancer, leading to greater understanding of the mechanisms underlying cancers as well as 
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Figure 5. Results on the galaxy dataset. Left: Histogram of the data (green 
bars), rug plot of the data (black ticks), and estimated densities using the 
MFM (red solid line) and DPM (bine dashed line). Right: MFM and DPM 
posteriors on the number of clusters (t), along with the MFM posterior on the 
number of components (k). 


Table f. Estimate of the MFM posterior on k for the galaxy dataset. 


k 

1 

2 

3 

4 

5 

6 

7 

Here 

R&G 

0.000 

0.000 

0.000 

0.000 

0.065 

0.061 

0.143 

0.128 

0.191 

0.182 

0.191 

0.199 

0.153 

0.160 


8 

9 

10 

11 

12 

13 

14 

15 

0.106 

0.109 

0.066 

0.071 

0.039 

0.040 

0.021 

0.023 

0.012 

0.013 

0.006 

0.006 

0.003 

0.003 

0.002 

0.002 


potentially providing patient-specific diagnostic tools. In gene expression datasets, there are 
typically a small number of very high-dimensional data points, each consisting of the gene 
expression levels in a given tissue sample under given conditions. 

One approach to analyzing gene expression data is to use Gaussian mixture models to 
identify clusters which may represent distinct cancer subtypes (Yeung et ah, 2001; McLachlan 
et ah, 2002; Medvedovic and Sivaganesan, 2002; Medvedovic et ah, 2004; de Souto et ah, 
2008; Rasmussen et ah, 2009; McNicholas and Murphy, 2010). In fact, in a comparative 
study of seven clustering methods on 35 cancer gene expression datasets with known ground 
truth, de Souto et ah (2008) found that finite mixtures of Gaussians provided the best 
results—when the number of components k was set to the true value. However, in practice, 
choosing an appropriate value of k can be difficult. Using the methods developed in this 
paper, the MFM provides a principled approach to inferring the clusters even when k is 
unknown, as well as doing inference for k, provided that the components are well-modeled 
by Gaussians. (However, see Section 8 for some potential pitfalls.) 

The purpose of this example is to demonstrate that our approach can work well even in 
very high-dimensional settings, and may provide a useful tool for this application. It should 
be emphasized that we are not cancer scientists, so the results reported here should not be 
interpreted as scientifically relevant, but simply as a proof-of-concept. 
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Data. We apply the MFM to gene expression data collected by Armstrong et al. (2001) 
in a study of leukemia subtypes. Armstrong et al. (2001) measured gene expression levels 
in samples from 72 patients who were known to have one of two leukemia types, acute 
lymphoblastic leukemia (ALL) or acute myelogenous leukemia (AML), and they found that 
a previously undistinguished subtype of ALL, which they termed mixed-lineage leukemia 
(MLL), could be distinguished from conventional ALL and AML based on the gene expression 
profiles. 

We use the preprocessed data provided by de Souto et al. (2008), which they filtered to 
include only genes with expression levels differing by at least 3-fold in at least 30 samples, 
relative to their mean expression level across all samples. The resulting dataset consists of 72 
samples and 1081 genes per sample, i.e., n — 72 and d = 1081. Following standard practice, 
we take the base-2 logarithm of the data before analysis, and normalize each dimension to 
have zero mean and unit variance. 

Model. For simplicity, we use multivariate Gaussian component densities with diagonal co- 
variance matrices, i.e., the dimensions are independent univariate Gaussians, and we place 
independent conjugate priors on each dimension. Thus, for each component, for % — 1,. .., d, 
dimension i is J\f Xf 1 ), with A; ~ Gamma(a, b) and nfXi ~ A/”(0, (cAj) -1 ). We choose 
a = 1, b = 1, and c = 1. (Recall that the data is zero mean, unit variance in each dimension.) 
For the MFM, K ~ Geometric(O.l) and 7 = 1, and for the DPM, a ~ Exponential(l). These 
are all simply default settings and have not been tailored to the problem; a careful scien¬ 
tific investigation would involve thorough prior elicitation, sensitivity analysis, and model 
checking. 

Inference. Given the partition C of the data into clusters, the parameters can be integrated 
out analytically since the prior is conjugate. Thus, for both the MFM and DPM, we use the 
split-merge sampler of Jain and Neal (2004) for conjugate priors, coupled with Algorithm 3 
of Neal (2000). Following Jain and Neal (2004), we use the (5,1,1) scheme: 5 intermediate 
scans to reach the split, launch state, 1 split-merge move per iteration, and 1 incremental 
Gibbs scan per iteration. 

Due to the high-dimensionality of the parameters, this has far better mixing time than 
sampling the parameters, as is done in reversible jump MCMC. 

The sampler was run for 1,000 burn-in iterations, and 19,000 sample iterations. This 
appears to be many more iterations than required for burn-in and mixing in this particular 
example—in fact, only 5 to 10 iterations are required to separate the clusters, and the results 
are indistinguishable when using only 10 burn-in and 190 sample iterations. The full state of 
the chain was recorded every 20 iterations. Each iteration took approximately 1.3 x lCT 3 n 
seconds, with n = 72. 

Results. The posteriors on the number of clusters t are concentrated at 3 (see Figure 6), 
in agreement with the division into ALL, MLL, and AML determined by Armstrong et al. 
(2001). The MFM posterior on k is shifted slightly to the right because there are a small 
number of observations; this accounts for uncertainty regarding the possibility of additional 
components that were not observed in the given data. 

Figure 6 also shows the MFM pairwise probability matrix, that is, the matrix in which 
entry (i, j) is the posterior probability that data points i and j belong to the same cluster; 
in the figure, white is probability 0, black is probability 1. (The DPM matrix, not shown, is 
indistinguishable from the MFM matrix.) The rows and columns of the matrix are ordered 
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Figure 6. Results on the leukemia gene expression dataset. Left: Posteriors 
on the number of clusters and components. Right: MFM pairwise probability 
matrix (the DPM matrix is the same). See the text for discussion. 


according to ground truth, such that 1-24 are ALL, 25-44 are MLL, and 45-72 are AML. 
The model has clearly separated the subjects into these three groups, with a small number 
of exceptions: subject 41 is clustered with the AML subjects instead of MLL, subject 52 
with the MLL subjects instead of AML, and subject 17 is about 50% ALL and 50% MLL. 

8. Discussion 

Due to the fact that inference for the number of components is a topic of high interest 
in many research communities, it seems prudent to make some cautionary remarks in this 
regard. Many approaches have been proposed for estimating the number of components 
(Henna, 1985; Keribin, 2000; Leroux, 1992; Ishwaran et ah, 2001; James et ah, 2001; Henna, 
2005; Woo and Sriram, 2006, 2007). In theory, the MFM model provides a Bayesian approach 
to consistently estimating the number of components, making it a potentially attractive 
method of assessing the heterogeneity of the data. However, there are several possible 
pitfalls to consider, some of which are more obvious than others. 

An obvious potential issue is that in many applications, the clusters which one wishes to 
distinguish are purely notional (for example, perhaps, clusters of images or documents), and a 
mixture model is used for practical purposes, rather than because the data is actually thought 
to arise from a mixture. Clearly, in such cases, inference for the “true” number of components 
is meaningless. On the other hand, in some applications, the data definitely comes from a 
mixture (for example, extracellular recordings of multiple neurons)—so there is in reality a 
true number of components—however, usually the form of the mixture components is far 
from clear. 

More subtle issues are that the posteriors on k and t can be 

(1) strongly affected by the base measure H, and 

(2) sensitive to misspecification of the family of component distributions {fe}- 

Issue (1) can be seen, for instance, in the case of normal mixtures: it might seem desirable 
to choose the prior on the component means to have large variance in order to be less 
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informative, however, this causes the posteriors on k and t to favor smaller values (Richardson 
and Green, 1997; Stephens, 2000; Jasra et ah, 2005). The basic mechanism at play here is the 
same as in the Bartlett-Lindley paradox, and shows up in many Bayesian model selection 
problems. With some care, this issue can be dealt with by varying the base measure H 
and observing the effect on the posterior—that is, by performing a sensitivity analysis—for 
instance, see Richardson and Green (1997). 

Issue (2) is more serious—in practice, we typically cannot expect our choice of {fg : 9 E 0} 
to contain the true component densities (assuming the data is even from a mixture). When 
the model is misspecihed in this way, the posteriors of k and t can be severely affected and 
depend strongly on n. For instance, if the model uses mixtures of Gaussians, and the true 
data distribution is not a finite mixture of Gaussians, then these posteriors can be expected 
to diverge to infinity as n increases. Consequently, the effects of misspecihcation need to be 
carefully considered if these posteriors are to be used as measures of heterogeneity. Steps 
toward addressing the issue of robustness have been taken by Woo and Sriram (2006, 2007) 
and Rodriguez and Walker (2014), however, this is an important problem demanding further 
study. 

Despite these issues, sample clusterings and estimates of the number of components or 
clusters can provide a useful tool for exploring complex datasets, particularly in the case of 
high-dimensional data that cannot easily be visualized. It should always be borne in mind, 
though, that the results can be interpreted as being correct only to the extent that the model 
assumptions are correct. 
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Appendix A. Formulas for some posterior quantities 

Below are some details regarding computation of the posterior on k and of the density 
estimates. 


Posterior on the number of components k. The posterior on t — \C\ is easily estimated from 
posterior samples of C. To compute the MFM posterior on k, note that 

oo n 

p(k\x l:n ) = y^ j p{k\t,x 1:n )p(t\xx :n ) = y^p(k\t)p(t\x 1:n ), 

t =1 t =1 

by Equation 3.8 and the fact that t cannot exceed n. Using this and the formula for p(k\t) 
given by Equation 3.6, it is simple to transform our estimate of the posterior on t into an 
estimate of the posterior on k. For the DPM, the posterior on the number of components k 
is always trivially a point mass at infinity. 


Density estimates. Using the restaurant process (Theorem 4.1), it is straightforward to show 
that if C is a partition of [n] and 0 = (0 C : c E C) then 


(A.l) 


p(x n +1 | C,f,x 1:n ) oc 


kn+l (t + 1) 

V n+ 1 (t) 


^m{x n+l ) + ^(|c| + i)U e {x n +i) 


ceC 
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where t = |C|, and, using the recursion for V n (t ) (Equation 3.9), this is normalized when 
multiplied by V n+ \{t)/V n (t). Further, 

(A.2) p(x n+ 1 | C,x 1:n ) oc Vn ^ t ^ 1 \ m(x n+ 1) + J^(|c| + 7 ) m (^u { »+ 1} ) ; 

Gt+ilG cgC m{x c ) 

with the same normalization constant. Therefore, when the single-cluster marginals m(x c ) 
can be easily computed, Equation A.2 can be used to estimate the posterior predictive 
density p(x n+ i\xi :n ) based on samples from C \ x\ :n . When m(x c ) cannot be easily computed, 
Equation A.l can be used to estimate pfx n+ i\xi- n ) based on samples from C, 0 | xi :n , along 

with samples 9 \,..., On ~ H to approximate m( x n +i) ~ A feX x n+ 1)- (Thanks to Steve 
MacEachern for pointing out how to handle m(x n+ 1) here.) 

The posterior predictive density is, perhaps, the most natural estimate of the density. 
However, following Green and Richardson (2001), a simpler way to obtain a natural estimate 
is by assuming that element n + 1 is added to an existing cluster; this will be very similar to 
the posterior predictive density when n is sufficiently large. To this end, we define p*(x n+ 1 | 
C,(j),xi- n ) = p(x n+ 1 | C, 4>, xi :n , |C n+ i| = |C|), where C n+ 1 is the partition of [n + 1], and 
observe that 

P*(x n + 1 | C, 0, X\:n) '/’ ' ' 7 f<p c { x n+ 1) 

^ n + jt 
cec ' 

where t = \C\ (Green and Richardson, 2001). Using this, we can estimate the density by 

1 N 

(A.3) —J2p*(x n +1 I C W ,0 W ,Z 1: n), 

i =1 

where (C^ 1 ), ^W),..., (C^ N \ <j>^) are samples from C, 0 | x\- n . The corresponding expressions 
for the DPM are all very similar, using its restaurant process instead. The density estimates 
shown in this paper are obtained using this approach. 

These formulas are conditional on additional parameters such as 7 for the MFM, and a 
for the DPM. If priors are placed on such parameters and they are sampled along with C and 
0 given Xi :n , then the posterior predictive density can be estimated using the same formulas 
as above, but also using the posterior samples of these additional parameters. 


Appendix B. Proofs 

Proof of Theorem 3.1. Letting 10 = {j : Zj = i}, and writing C(z ) for the partition induced 
by z = (z 1 ,, z n ), by Dirichlet-multinomial conjugacy we have 


J p(z\ir)p(Tr\k)diT 


£ (^) n 

r(y) fc r(n + /c7) 


1 

(fcy)W 


n 7 <ici) 

ceC(z) 


p(z\k) 
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for z G [k] n , provided that px{k) > 0. It follows that for any partition C of [n], 


(B.l) 


p{C\k) = E P{A k ) 

zG[fc] n : C(z)=C 

= #h 6 W" : C(z) = C 


nt N> , 

cec 




n> 

ceC 


(Id) 


kt) 


(yfc)( n ) 


where t = \C\, since G [k} n : C(z ) = C} = (^)t! = k( t y Finally, 

00 0° 7 

p(C) = J2 p ( C \k)p K (k) = (n7 (|c|) )E f k\(n) PK ^ = ^wn7 (|c|) > 

h =1 cGC fc=l ' ' cGC 

with V n (t ) as in Equation 3.2. 


□ 


Proof of Equation 3.3. Theorem 3.1 shows that the distribution of C is as shown. Next, 
note that instead of sampling only 0i,... , 6 k ~ H given K = k , we could simply sample 
61 , 62 ,... ~ H independently of K, and the distribution of X 1:n would be the same. Now, 
Z\ :n determines which subset of the i.i.d. variables 0i,0 2 ,... will actually be used, and the 
indices of this subset are independent of 61 , 62 , •••; hence, denoting these random indices 
I\ < ■ ■ ■ < It, we have that 0j x ,..., 0i T \Z\ :n are i.i.d. from H. For c G C, let (j) c = 6 j % where 
i is such that c = {j : Zj = /,}. This completes the proof. □ 


Proof of the properties in Section 3.1. Abbreviate x = X\, n , z = z 1:n , and 0 = 0 1 : k, and 
assume p(z, k) > 0. Letting E % = {j : Zj = i}, we have p(x\ 6 , z, k) = n!Li Wj&E, fdi( x j ) an d 

L 

= \\mfxEi) = m(x c ). 

1=1 ceC(z) 


II fOi(xij\H(de. 


j&Ei 


p(x\z,k )= / p{x\ 6 ,z,k)p{dd\k) = W 


1 © fc 


Since this last expression depends only on z,k through C = C(z), we have p(x\C) = 
n ceC m(* c ), establishing Equation 3.4. Next, recall that p(C\k) = ^l n) ricec(where 
t = |C|) from Equation B.l, and thus 

pm= E Mk) = jAkh ^ 

C:|C|=t W ' C:\C\=t. cEC 

(where the sum is over partitions C of [n] such that \C\ — t) establishing Equation 3.5. 
Equation 3.6 follows, since 

p(k\t) oc p(t\k)p(k) oc p K (k), 
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(provided p(t) > 0) and the normalizing constant is precisely V n (t). To see that C T K \ T 
(Equation 3.7), note that if t = \C\ then 


p(C\t,k) 


p(C,t\k) 
p(t\k) 


p(C\k) 

p{t\k) ’ 


(provided p(t, k ) > 0) and due to the form of p(C\k) and p(t\k) just above, this quantity does 
not depend on k ; hence, p(C\t, k) = p(C\t). To see that X T K \ T (Equation 3.8), note that 
X T K | C] using this in addition to C T K \ T , we have 


p(x\t,k) = Y p(x\C,t,k)p(C\t,k)= Y =P(x\t). 

C:\C\=t C:\C\=t 


□ 


Proof of Theorem f.l. Let be the random partition of Z>o as in Section 3.3, and for 
n E {1,2,...}, let C n be the partition of [n] induced by C^. Then 

p(Cn\Cn—li ■ ■ • i Cl) p(C n \Cn—l) OC q n (C n ) I(Cn \ Tl C n — i), 


where C\n denotes C with element n removed, and /(•) is the indicator function (1(E) = 1 
if E is true, and 1(E) = 0 otherwise). Recalling that q n (C n ) = V n (\C n \) J{ cgC (Equation 
3.1), we have, letting t = |C n _i|, 


(r lr \ / V n (t + 1)7 if n is a singleton in C n , i.e., {n} G C n 

p{ n n- 1 ) OC | V n (t)(j + |c|) if c E C n _i and c U {?r} G C n , 

for C n such that C n \ n = C n -\ (and p(C n \C n -i) = 0 otherwise). With probability 1, 
g n _i(C n _i) > 0, thus V n -i(t) > 0 and hence also V n (t) > 0, so we can divide through 
by V n (t) to get the result. □ 


Proof of Theorem f .2. Let G ~ Xi(p K ,'y, H) and let f3i,..., f3 n ~ G, given G. Then the 
joint distribution of (f3i ,..., f3 n ) (with G marginalized out) is the same as (6z { ,..., Qz n ) in 
the original model (Equation 2.1). Let C n denote the partition induced by Zi,...,Z n as 
usual, and for c G C n , define 0 C = 6j where I is such that c = {j : Zj = /}; then, as in the 
proof of Equation 3.3, (cp c : c G C n ) are i.i.d. from H , given C n . 

Therefore, we have the following equivalent construction for (/3 1; ... ,/3 n ): 

C n ~ q n , with q n as in Section 3.3 

(j) c ~ H for c G C n , given C n 

(3j = (j) c for j Ec, c e C n , given C n , 0. 

Due to the self-consistency property of q \, q- 2 ,... (Proposition 3.3), we can sample C n , (<f c : 
c G C n ), /3i :n sequentially for n — 1, 2,... by sampling from the restaurant process for C n \C n -i, 
sampling from H if n is placed in a cluster by itself (or setting 0 cU {n} = 4> c if n is added 
to c E C n _ 1), and setting /3 n accordingly. 

In particular, if the base measure H is continuous, then the 0’s are distinct with probability 
1, so conditioning on 0i :n _i is the same as conditioning on C n - 1, (0 C : c E C n _i), 0i :n -i, and 
hence we can sample 0 n |/9i :n -i in the same way as was just described. I11 view of the form 
of the restaurant process (Theorem 4.1), the result follows. □ 


We use the following elementary result in the proof of Theorem 5.1; it is a special case of 
the dominated convergence theorem. 
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Proposition B.l. For j = 1,2,, let a\ 3 > a, 2 j > • • • > 0 such that a %3 -A 0 as i -A oo. If 
aij < oo then Y^jLi a ij ^ 0 as i ^ oo. 


Proof of Theorem 5.1. For any £ > 0, writing xf n ' > jn\ = r(x + n)/{n\ r(a;)) and using Stir¬ 
ling’s approximation, we have 

x (n) fi x ~^ 

n\ r(x) 

as n —y oo. Therefore, the k — t term of V n (t) is 


Ht) 


PK(t) 


r^i 


r(7t) 

n\ rt J* -1 


p A -(t). 


The hrst t — 1 terms of V n (t) are 0, so to prove the result, we need to show that the rest of 
the series, divided by the k — t term, goes to 0. (Recall that we have assumed pic(t) > 0.) 
To this end, let 

Kk = ^ (n) J^H PK ^- 

We must show that Ylh=t +i ^ nk _■ y 0 as n —> oo. We apply Proposition B.l with a %3 = b t + l t+3 . 
For any k > t, b\ k > b 2k > • • • > 0. Further, for any k > t, 

( i yt)^ n 1 *- 1 r(y k) 

( 7 k)G) r( 7 t) n^ k ~ l 

as n —> 00 , hence, b nk —* 0 as n —* 00 (for any k > t). Finally, observe that Y^k=t+i bnk < 
( 7 t)^V n (t) < 00 for any n > t. Therefore, by Proposition B.l, YlkLt+i b nk —* 0 as n —* 00 . 
This proves the result. □ 


Proof of Theorem 5.2. For any t 6 {1,..., fc}, 

(B.2) = = = 

as n —00 (where p n denotes the MFM distribution with n samples), by Equation 3.6 and 
Theorem 5.1. For any n > k, 

k 

p{K = k I x 1:n ) = y ^p{K = k\T = t,x 1:n )p(T = t \ x 1:n ), 

t =1 

and note that by Equations 3.8 and B.2, p(K — k\T — t,x i :n ) = p n {K — k | T — t) —> 
I(k — t) for t < k. The result follows. □ 
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